This dataset contains the data of 1000 randomly chosen students with some characteristics. The objective of this project is to show some Feature engineering techniques before applying Machine learning.

This dataset is very simple, most of the data is of the binary type, which is why it does not give as many options to apply other types of techniques, but it does give us a vision about how to occupy a dataset that does not give you so many different variables.

gender : Contains the male/female gender

race/ethnicity : Contains the type of group

parental level of education : School grade

lunch : Type of lunch

test preparation course : Took a preparation course yes/not

math score : Math score

reading score :Score in reading

writing score : Writing score

For ease when using the code change the font type.

snake_case

The nomenclature "snake" is defined in this way because we always use it on the floor, this means that the letters are always lowercase and the different words that make up the name or the definition that you want to give are separated by an underscore of this form: my_name_is.

Dataset analysis.

Create a new column containing the summed average of the grades.

New column that contains if the average is passing or failing (greater than 60).

Create a grade column according to the US system.

Graph analysis.

This graph compares the value of the average of each student to know the number of how many passed and how many failed, clearly seeing that the majority passed on average.

In the United States system, the failing grade is F and the passing grade is A. In the pie chart we can see that only 4.5% obtained the value of A while the majority obtained the value of F, although in a more numeric we can know that the value of f ranges from 0-60 while the value of A ranges from 90-100.

In box plot we can observe a distribution of continuous numbers in which it gives us another idea of the percentage column, we see that most of the qualifications are centered between 44-81 being the average of 63 that is to say that the majority of people got a close average to these values. The most outlier is 9.

In this table we can see that the average is 68 and the values that follow closely are 69 and 73, which are the values that are repeated the most in the general average.

With this histogram we can see that most of the highest marks are in the writing test while the lowest are in mathematics.

Comparison of the grade value of each exam with the outliers, the most outliers are in math while the least outliers are in the reading exam, we can compare their averages in math 66, reading 70 and writing 69.

We can clearly observe the dispersion in each variable.

A boxplot with the scatter of the 1000 data.

Created a new DataFrame that contains numeric values instead of categorical ones.

Feature engineering

Mutualinformation score

Depending on the column that you want to predict, it is compared with the other columns to find out how much they coincide, but first you have to make all the data numerical.

This graph shows which column is more important for the prediction.

The following graphs show if there is a correlation between two variables.

Comparison of the variables using cluster k-means.

Mutual regression technique with variance to obtain the variables that have correlation.